Part 1 - Project Based

DOMAIN: Automobile


CONTEXT: The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

DATA DESCRIPTION: The data concerns city-cycle fuel consumption in miles per gallon.

Attribute Information:
1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)

PROJECT OBJECTIVE: Goal is to cluster the data and treat them as individual datasets to train Regression models to predict ‘mpg’.

1. Import and warehouse data:
• Import all the given datasets and explore shape and size.
• Merge all datasets onto one and explore final shape and size.
• Export the final dataset and store it on local machine in .csv, .xlsx and .json format for future use.
• Import the data from above steps into python.

  • Import all the given datasets and explore shape and size.
  • Merge all datasets onto one and explore final shape and size.
  • Export the final dataset and store it on local machine in .csv, .xlsx and .json format for future use.
  • Import the data from above steps into python.
  • 2. Data cleansing:
    • Missing/incorrect value treatment
    • Drop attribute/s if required using relevant functional knowledge
    • Perform another kind of corrections/treatment on the data.

    Missing/incorrect value treatment.

    We can see that:

    1. There are no null values in the dataset.
    2. "hp" column has datatype as object. we will have to check this column.

    We can see there are '?' in the column and hence the datatype as object.

    Drop attribute/s if required using relevant functional knowledge
    We can drop these rows since it will only make us miss some part of the dataset.

    Perform another kind of corrections/treatment on the data.

    3. Data analysis & visualisation: [ Score: 4 points ]
    • Perform detailed statistical analysis on the data.
    • Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.

    Perform detailed statistical analysis on the data.

    We can see that:

    1. There is a high standard deviation for the columns "disp" and "wt"
    2. There may be potential outliers in the "disp", "hp" and "wt" columns as there is a definite difference in the 75% and the maxvalues in these columns.
    3. The car with name = ford Pinto has the max occurence in the dataset.

    Univariate Analysis

    Numerical Columns

    mpg

  • The distribution is relatiively normal with positive skewness.
  • There are no outliers in this column.
  • Most of the records have mpg values less than 30.
  • disp

  • The distribution is not normal, there are three peaks in the distribution plot ie there may be possibly 3 clusters.
  • There are no outliers in this column.
  • 50% of the records have values less than 151.
  • hp

  • The distribution is not normal, the data has positive skewness.
  • There are 10 outliers in this column.
  • 75% of the records have values less than 126
  • There are 2 peaks in the distribution plot.
  • wt

  • The distribution is relatiively normal with positive skewness.
  • There are no outliers in this column.
  • 75% of the records have values less than 3600
  • acc

  • The distribution is normal.
  • There are 11 outliers in this column.
  • Categorical Columns

    origin

  • Orgin is hghly imbalanced.
  • Origin with value 1 has the maximum values in the dataset, foloowed by 3 and then 2 as shown by the above graphs.
  • The 2 and 3 categories have nearly similar distribution.
  • cyl

  • The Column is highly imbalanced.
  • Most of the records have 4 cylinders.
  • 4 alone consists of 50.77% of data,whereas 8 & 6 are nearly in same proportion.
  • 3 & 5 collectively accounts for only 7 entries i.e., 1.8% of entire data.
  • yr

  • Most of the records have the car models from the 70's year
  • The data is relatively balanced for this column.
  • mpg_level

  • Most of the cars are in the medium mpg_level zone.
  • We can notice that the car_name column has a compnay name as a prefix,so maybe it will be fruitful to extract them as separate feature and do analysis on that.

  • Distribution of car_company is not uniform
  • Most of the proportion is covered by top 15 car companies
  • Ford and Chevrolet alone comprises of around 23% (almost a quarter)
  • Bivariate Analysis

  • In the starting year manufacturing in origin 1 is dominated completely.
  • As the years progress origin 2 and 3 started manufacturing more vehicles
  • Origin2 manufactures more vehicle then origin 3 but then origin2 exceeds it after 76.
  • As the year progresses vehicles with more cylinders (8 & 6) decreases significantly.
  • As the year progresses vehicles with less cylinders increases.
  • Throughout the years vehicles with 4 cylinders have significant proportion.
  • These results make sense because as year progresses technology advances vehicles with low mpg and more cylinders looses focus and vehicles with high mpg and less cylinders are the new stars.
  • Multivariate Analysis

    Insights

    1. There could possibly be 3*3 = 9 clusters in this data.
    2. As mpg increases displacement, horsepower & weight decreases but acceleration increases.
    3. As horsepower increases displacement & weight increases but acceleration decreases.
    4. As weight increases displacement increases but acceleration decreases.
    5. As acceleration increases displacement decreases.

    We can see there is very high correlation between wt, disp,hp, cyl and mpg, just as we would expect.

    4. Machine learning:
    • Use K Means and Hierarchical clustering to find out the optimal number of clusters in the data.
    • Share your insights about the difference in using these two methods.

    Heirarchical Clustering

    Appers to be to much of a visual clutter, we'll go ahead and cut down the dendrogram to give us 2 clusters/groups

    Clearly shows two disting group with a difference in average between the clusters and variables

    K Means Clustering

    Clearly shows two distinct group with a difference in average between the clusters and variables

    Linear regression on the original dataset

    Linear regression on data with K means clustering

    Linear regression on data with Hierarchial Clustering

  • K-means appears to explain the highest variation in the datset, but with a difference of only 3% when compared with other models, to get more clarity a larger dataset may be used.
  • When the data is separated with clusters the predection output accuracy is increases significantly.

  • 6. Improvisation: Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the company to erform a better data analysis in future.
    The reason the cars are being used would give information about the cars too , the datset doen capture that information.
    The units are not mentioned in all the columns that would be beneficial too.
    The Manufacturer information , like which country should be given.
    No information relevant to the business need provided.

    Part 2 - Project Based

    DOMAIN: Manufacturing


    CONTEXT: Company X curates and packages wine across various vineyards spread throughout the country.

    DATA DESCRIPTION: The data concerns the chemical composition of the wine and its respective quality.
    Attribute Information:

    1. A, B, C, D: specific chemical composition measure of the wine
    2. Quality: quality of wine [ Low and High ]

    PROJECT OBJECTIVE: Goal is to build a synthetic data generation model using the existing data provided by the company.
    1. Design a synthetic data generation model which can impute values [Attribute: Quality] wherever empty the company has missed recording the data.

    There are 18 rows where the compnay has missed to input the quality of the wine sample.

    There appears to be no misclassification when checking the it with the non missing target variables and the predicted clusters, Hence the new labels can be used as a target variable for the data.

    Part 3 - Project Based

    DOMAIN: Automobile


    CONTEXT: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette.
    The vehicle may be viewed from one of many different angles.

    DATA DESCRIPTION: The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.
    All the features are numeric i.e. geometric features extracted from the silhouette.

    PROJECT OBJECTIVE: Apply dimensionality reduction technique – PCA and train a model using principal components instead of training the model using just the raw data.
    1. Data: Import, clean and pre-process the data

  • There are some null values in almost all of the columns. we will have to impue or delete them.
  • As mentioned all the columns contain numeric datatypes only
  • For scatter_ratio,scaled_variance, scaled_variance1 the mean and median are far from each other ie the data is not normally distributed.

  • We can see that there are outliers in the data from the summary thus we will do boxplot analysisto find the outliers in the data.

  • scaled_variance.1 has very high std deviation, we will see the spread of data.

  • 2. EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hiddenpatterns by using all possible methods.

    Univariate Analysis

  • From the visualization we can see that there are outliers in the data and the counts of outliers have also been displayed.(We will work on the outliers later.)

  • Columns have data distributed across multiple scales.

  • Several columns have distributions that are not unimodal (eg.: distance_circularity, hollows_ratio, elongatedness

  • Column skweness_about, skewness_about.1 have data that is right skewed whereas for column skewness_about.2 data is nearly normally distributed.

  • Some columns have long right tail (eg.: pr.axis_aspect_ratio)

  • Bivariate Analysis

  • As we can see there is significant difference between classes when compared with the mean and median with all the numeric attributes.

  • Multivariate Analysis

    Most of the columns look like they have linear relationships and hence we will see the correlation of all the attributes.

  • Elongatedness has visibily high correlation with almost all columns .

  • Many Columns have correlation so we will choose which columns are of importance to us.

  • 3. Classifier: Design and train a best fit SVM classier using all the data attributes.

    4. Dimensional reduction: perform dimensional reduction on the data.

  • We can see that the first six components explain more than 95% of variation.

  • Between first 5 components, more than 91% of the information is captured.

  • The above plot shows almost 95% variance by the first 6 components. Therefore we can drop 7th component onwards.

  • 5. Classifier: Design and train a best fit SVM classier using dimensionally reduced attributes.

    6. Conclusion: Showcase key pointer on how dimensional reduction helped in this case.

  • Both the model give more than 90% accuracy on the test data.

  • PCA used only 6 attributes to come up with an accuracy of 90%+ whereas the model without pca used all the variables to come up with 90%+ accuracy.

  • Part 4 - Project Based

    DOMAIN: Sports management

    CONTEXT: Company X is a sports management company for international cricket.

    DATA DESCRIPTION: The data is collected belongs to batsman from IPL series conducted so far. Attribute Information:
    1. Runs: Runs score by the batsman
    2. Ave: Average runs scored by the batsman per match
    3. SR: strike rate of the batsman
    4. Fours: number of boundary/four scored
    5. Six: number of boundary/six scored
    6. HF: number of half centuries scored so far

    PROJECT OBJECTIVE: Goal is to build a data driven batsman ranking model for the sports management company to make business decisions.

    1. EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden patterns by using all possible methods.

  • There are no null values in the dataset.

  • All the columns have expected datatypes.

  • All the distributions are unimodal.

  • All variables except Strike rate are positively skewed.

  • There are outliers in the data,but we will not be treating them as its highly likely that these are genuine observation.

    We can observe that all the variables have high correlation

    2. Model Builiding: Build a data driven model to rank all the players in the dataset using all or the most important performance features.

    Part 5 - Question Based

    1. List down all possible dimensionality reduction techniques that can be implemented using python.
    Dimensionality Reduction Tehcniques can be classified into 3 types:
    Feature selection:
    a) Missing Value Ratio: If the dataset has too many missing values, we use this approach to reduce the number of variables. We can drop the variables having a large number of missing values in them.
    b) Low Variance filter: We apply this approach to identify and drop constant variables from the dataset. The target variable is not unduly affected by variables with low variance, and hence these variables can be safely dropped.
    c) High Correlation filter: A pair of variables having high correlation increases multicollinearity in the dataset. So, we can use this technique to find highly correlated features and drop them accordingly.
    d) Random Forest: This is one of the most commonly used techniques which tells us the importance of each feature present in the dataset. We can find the importance of each feature and keep the top most features, resulting in dimensionality reduction.
    e) Both Backward Feature Elimination and Forward Feature Selection techniques take a lot of computational time and are thus generally used on smaller datasets.

    Components / Factor Based:
    a) Factor Analysis: This technique is best suited for situations where we have highly correlated set of variables. It divides the variables based on their correlation into different groups, and represents each group with a factor.
    b) Principal Component Analysis: This is one of the most widely used techniques for dealing with linear data. It divides the data into a set of components which try to explain as much variance as possible.
    c) Independent Component Analysis: We can use ICA to transform the data into independent components which describe the data using less number of components.

    Projection Based:
    a) ISOMAP: We use this technique when the data is strongly non-linear.
    b) t-SNE: This technique also works well when the data is strongly non-linear. It works extremely well for visualizations as well.
    c) UMAP: This technique works well for high dimensional data. Its run-time is shorter as compared to t-SNE.

    2. So far you have used dimensional reduction on numeric data. Is it possible to do the same on a multimedia data and text data ? Please illustrate your findings using a simple implementation on python.
    Yes, Dimensionality reduction techniques can work for Multimedia data as well as text data. Exploring Handwritten Digits

    The images data is a three-dimensional array: 1,797 samples, each consisting of an 8×8 grid of pixels.

    1,797 samples and 64 features.

    Another way to gain intuition into the characteristics of the model is to plot the inputs again, with their predicted labels. We’ll use green for correct labels, and red for incorrect labels